Introduction

In this project, we will explore the financial contributions to 2016 US Presidential candidates of California (CA). We try to unconver the statistics in contribution data, as well as its relationship with the voters’ background and election result of CA usning exploratory data analysis techiniques.

The data set contains about 1M contributions records from 2014 to 2016 of CA, it contains 58 counties in CA, 25 condidates of Democrat, Republican and other parties.

Data Wrangling

In this section, we audit, clean and manupulate data to preapre for later exploration. Data will be firstly loaded and preprocessed sperately, then several dataframes will be built and used in the following sections.

Data Preprocessing

We will load and take glimps of the data rows. After it the data will be preprocessed by dropping unused part, renaming columns, covnertting data types, and formatting.

Financial contribution data

Data structure
## 'data.frame':    1125660 obs. of  19 variables:
##  $ V1 : Factor w/ 26 levels "C00458844","C00500587",..: 26 6 6 6 7 7 7 7 6 7 ...
##  $ V2 : Factor w/ 26 levels "cand_id","P00003392",..: 1 2 2 2 13 13 13 13 2 13 ...
##  $ V3 : Factor w/ 26 levels "Bush, Jeb","cand_nm",..: 2 5 5 5 21 21 21 21 5 21 ...
##  $ V4 : Factor w/ 195944 levels " ALERIS, ANNAKIM",..: 33897 6555 26622 59281 99849 101095 101095 101122 78047 101139 ...
##  $ V5 : Factor w/ 2118 levels "","-4086",".",..: 399 943 260 607 259 1505 1505 2003 907 2055 ...
##  $ V6 : Factor w/ 2 levels "CA","contbr_st": 2 1 1 1 1 1 1 1 1 1 ...
##  $ V7 : Factor w/ 134035 levels "","00000","000090272",..: 134033 107295 66913 50011 61864 16081 16081 42612 54660 57711 ...
##  $ V8 : Factor w/ 57984 levels ""," APPLE INC.",..: 12079 NA NA NA 3977 54916 54916 36582 NA 35437 ...
##  $ V9 : Factor w/ 25387 levels ""," ATTORNEY",..: 4972 18993 18993 18993 21225 15921 15921 17407 18993 14744 ...
##  $ V10: Factor w/ 9578 levels "-.01","-.24",..: 9578 7397 4167 7348 6763 6333 1478 5112 6763 1431 ...
##  $ V11: Factor w/ 660 levels "01-APR-15","01-APR-16",..: 660 540 408 25 78 99 121 78 408 99 ...
##  $ V12: Factor w/ 75 levels "","* EARMARKED CONTRIBUTION: SEE BELOW REATTRIBUTION/REFUND PENDING",..: 56 1 1 1 1 1 1 1 1 1 ...
##  $ V13: Factor w/ 3 levels "","memo_cd","X": 2 3 3 3 1 1 1 1 3 1 ...
##  $ V14: Factor w/ 424 levels "","*","* $550 REFUNDED 6/16/16",..: 273 39 39 39 4 4 4 4 39 4 ...
##  $ V15: Factor w/ 4 levels "form_tp","SA17A",..: 1 3 3 3 2 2 2 2 3 2 ...
##  $ V16: Factor w/ 187 levels "1003942","1004025",..: 187 141 141 141 106 106 106 106 141 106 ...
##  $ V17: Factor w/ 1122206 levels "A000771210424405B8CF",..: 714265 327338 326620 324002 858678 860122 862423 858140 326658 860118 ...
##  $ V18: Factor w/ 5 levels "","election_tp",..: 2 4 4 4 4 4 4 4 4 4 ...
##  $ V19: logi  NA NA NA NA NA NA ...
Data cleaning
  1. drop unused data
  • only keep candidate name, contribution city, zip, amount, date, contributor's employer and occupation
  • rename column names for consistence of all dataframes
  1. modify data type
  • date from string to date
  • amount and zip to numbers
  • formats of city, candidate, employer, occupation
  1. auditing
  • check for invalid values such as N/A

  • load and preprocess data

  • check state equals to CA

  • check number of NA in each columns
  • for employer we only care the top ranks, minor numbers of NA will be ignored
  • for occupation we will decide if further cleaning is necessary after more exploration

##  candidate       city        zip   employer occupation     amount       date 
##          0          1        102          0          0          0          0

Average family income by county

We will use the family income data to gain more background of the contributors and voters in CA. The data comes from US 2014 census. The preprocessing follow similar steps as contribution data.

Data structure
## 'data.frame':    60 obs. of  9 variables:
##  $ GEO.id                  : Factor w/ 2 levels "0400000US06",..: 2 1 1 1 1 1 1 1 1 1 ...
##  $ GEO.id2                 : Factor w/ 2 levels "06","Id2": 2 1 1 1 1 1 1 1 1 1 ...
##  $ GEO.display.label       : Factor w/ 2 levels "California","Geography": 2 1 1 1 1 1 1 1 1 1 ...
##  $ GCT_STUB.target.geo.id  : Factor w/ 60 levels "0400000US06",..: 60 1 2 3 4 5 6 7 8 9 ...
##  $ GCT_STUB.target.geo.id2 : Factor w/ 60 levels "06","06001","06003",..: 60 1 2 3 4 5 6 7 8 9 ...
##  $ GCT_STUB.display.label  : Factor w/ 60 levels "California","California - Alameda County",..: 60 1 2 3 4 5 6 7 8 9 ...
##  $ GCT_STUB.display.label.1: Factor w/ 60 levels "Alameda County",..: 12 6 1 2 3 4 5 7 8 9 ...
##  $ HC01                    : Factor w/ 60 levels "106401","108088",..: 60 41 57 43 38 24 37 23 59 17 ...
##  $ HC02                    : Factor w/ 60 levels "1029","1030",..: 60 28 57 5 46 17 44 39 4 42 ...
Data cleaning
  • drop unused columns except for county name and family_income
  • convert income to numeric value
##      county family_income
## 3   Alameda         90822
## 4    Alpine         71932
## 5    Amador         68765
## 6     Butte         56934
## 7 Calaveras         67100
## 8    Colusa         56472

Votes of presendtial election

Data structure
## 'data.frame':    58 obs. of  4 variables:
##  $ Vote.by.county: Factor w/ 58 levels "Alameda","Alpine",..: 19 30 37 33 36 43 1 34 7 38 ...
##  $ Clinton       : Factor w/ 58 levels "1,601,382","1,614",..: 1 35 34 24 22 31 29 13 20 23 ...
##  $ Trump         : Factor w/ 58 levels "1,030","1,989",..: 46 31 24 22 15 56 49 8 51 23 ...
##  $ Rpt.          : Factor w/ 1 level "100%": 1 1 1 1 1 1 1 1 1 1 ...
Data cleaning
  • drop unused columns except for county name and votes for Clinton and votes for Trump
  • convert votes to integers
##           county Clinton  Trump
## 1    Los Angeles 1601382 525308
## 2         Orange  395801 356892
## 3      San Diego  390531 269422
## 4      Riverside  228731 219359
## 5 San Bernardino  220853 188080
## 6    Santa Clara  312994  91990

Presendential Candidates

  • the candidates data has been preprocessed
## 'data.frame':    25 obs. of  3 variables:
##  $ candidate: Factor w/ 25 levels "Bush","Carson",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ full.name: Factor w/ 25 levels "Bush, Jeb","Carson, Benjamin S.",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ party    : Factor w/ 3 levels "Democrat","Other",..: 3 3 3 1 3 3 3 3 3 3 ...
  • convert items to character type
##   candidate                 full.name      party
## 1      Bush                 Bush, Jeb Republican
## 2    Carson       Carson, Benjamin S. Republican
## 3  Christie  Christie, Christopher J. Republican
## 4   Clinton   Clinton, Hillary Rodham   Democrat
## 5      Cruz Cruz, Rafael Edward 'Ted' Republican
## 6   Fiorina            Fiorina, Carly Republican

City and County Data

  • we need them to map city name or zip to county
  • the population by county will also be used
Data Structure
##     zip       city state latitude longitude  fips
## 1 00210 Portsmouth    NH  43.0059  -71.0132 33015
## 2 00211 Portsmouth    NH  43.0059  -71.0132 33015
## 3 00212 Portsmouth    NH  43.0059  -71.0132 33015
## 4 00213 Portsmouth    NH  43.0059  -71.0132 33015
## 5 00214 Portsmouth    NH  43.0059  -71.0132 33015
## 6 00215 Portsmouth    NH  43.0059  -71.0132 33015
##      county_name state state_fips county_fips fips_class  CSA  CBSA population
## 1 Autauga County    AL         01         001         H1 <NA> 33860      54571
## 2 Baldwin County    AL         01         003         H1  380 19300     182265
## 3 Barbour County    AL         01         005         H1 <NA>  <NA>      27457
## 4    Bibb County    AL         01         007         H1  142 13820      22915
## 5  Blount County    AL         01         009         H1  142 13820      57322
## 6 Bullock County    AL         01         011         H1 <NA>  <NA>      10914
Combing data
  • we will keep CA data in zip_codes and counties
  • then merge county information to zip_codes using county identifier fips
##         zip        city state latitude longitude fips
## 39141 89439       Verdi    CA 39.52369 -120.0301 6091
## 39206 90001 Los Angeles    CA 33.97291 -118.2488 6037
## 39207 90002 Los Angeles    CA 33.94832 -118.2485 6037
## 39208 90003 Los Angeles    CA 33.96271 -118.2760 6037
## 39209 90004 Los Angeles    CA 34.07711 -118.3076 6037
## 39210 90005 Los Angeles    CA 34.05891 -118.3085 6037

Data Manupulation

Merge dataframes

  • merge income, counties to votes, as they all belong to information of each county
##     county             Clinton            Trump        family_income   
##  Length:58          Min.   :    318   Min.   :   211   Min.   : 45296  
##  Class :character   1st Qu.:   5149   1st Qu.:  5643   1st Qu.: 52789  
##  Mode  :character   Median :  21155   Median : 14604   Median : 62945  
##                     Mean   :  94515   Mean   : 51133   Mean   : 66744  
##                     3rd Qu.:  73545   3rd Qu.: 49241   3rd Qu.: 75677  
##                     Max.   :1601382   Max.   :525308   Max.   :120030  
##    population     
##  Min.   :   1175  
##  1st Qu.:  48001  
##  Median : 179140  
##  Mean   : 642310  
##  3rd Qu.: 642593  
##  Max.   :9818605
##      county Clinton Trump family_income population
## 1   Alameda  300185 59384         90822    1510271
## 2    Alpine     318   211         71932       1175
## 3    Amador    3420  6069         68765      38091
## 4     Butte   31638 35354         56934     220000
## 5 Calaveras    6225 10126         67100      45578
## 6    Colusa    1832  2709         56472      21419

Map city to county

  • get the map between city to county and store it in zip_codes
  • it will be used as a bridge between contributions and votes or counties
##     zip      city  county
## 1 94604   Oakland Alameda
## 2 94551 Livermore Alameda
## 3 94542   Hayward Alameda
## 4 94621   Oakland Alameda
## 5 94557   Hayward Alameda
## 6 94623   Oakland Alameda

Map contribution city to county

Audit
  • we can use zip code or city name for the mapping between zip_codes and contributions
  • There are four types of error about columns county_by_zip and county_by_city
  1. Only county_by_zip is NA, then we believe it’s valid and will fix city by looking up the right zip in zip_codes
  2. Only county_by_city is valid, similar to type 1
  3. Both columns are not NA, then we lack the information about the truth, we will keep the rows except for statistics when county is involved
  4. Both columns are NA, it should be dropped as invalid data
##     candidate                  city   zip
## 5     Sanders             Camarillo 93011
## 39    Clinton             Palo Alto 94301
## 105      Cruz          Kagel Canyon 91342
## 194   Clinton           Playa Vista 90094
## 201   Sanders                Goleta 93116
## 255   Sanders             Palo Alto 94301
## 310   Sanders             Palo Alto 94303
## 338   Clinton             Montecito 93108
## 355   Sanders             Palo Alto 94303
## 360   Clinton Rolling Hills Estates 90274
##                                   employer              occupation amount
## 5                At&T Government Solutions       Software Engineer     40
## 39                                      NA                 Retired    500
## 105               City Of Huntington Beach           Civil Servant     35
## 194                                     NA                 Retired    300
## 201                                Impulse        Network Engineer    100
## 255                     Ca Dept Of Justice                Attorney     27
## 310           Palo Alto Medical Foundation Radiologic Technologist     15
## 338 Manageamerica Integrated Billing Servi          Business Owner    250
## 355                          Self Employed           Self Employed     50
## 360                                Retired           Former Banker    100
##           date county_by_zip county_by_city
## 5   2016-03-04          <NA>        Ventura
## 39  2016-04-15   Santa Clara      San Mateo
## 105 2016-04-05   Los Angeles           <NA>
## 194 2016-04-04   Los Angeles           <NA>
## 201 2016-03-06          <NA>  Santa Barbara
## 255 2016-03-04   Santa Clara      San Mateo
## 310 2016-03-06   Santa Clara      San Mateo
## 338 2016-04-07 Santa Barbara           <NA>
## 355 2016-03-06   Santa Clara      San Mateo
## 360 2016-04-04   Los Angeles           <NA>
  • Total error number of county entry
## [1] 44016
  • Number of conflicting county entry
## [1] 13999
  • Number of invalid county entry
## [1] 247
Cleaning Contribution
  • We refer zip_codes as the standard data source, and will try to indentify and fix errors in county column of countributions table by combining two county columns.
  • fix the first two types of error

  • then final number of invalid county entries

## [1] 14246
  • invalid county entries examples
##     candidate             city   zip                     employer
## 39    Clinton        Palo Alto 94301                           NA
## 255   Sanders        Palo Alto 94301           Ca Dept Of Justice
## 310   Sanders        Palo Alto 94303 Palo Alto Medical Foundation
## 355   Sanders        Palo Alto 94303                Self Employed
## 453   Clinton Westlake Village 91361                      Retired
## 602   Clinton            Tracy 95377      Mental Health Counselor
##                  occupation amount       date county_by_zip county_by_city
## 39                  Retired    500 2016-04-15   Santa Clara      San Mateo
## 255                Attorney     27 2016-03-04   Santa Clara      San Mateo
## 310 Radiologic Technologist     15 2016-03-06   Santa Clara      San Mateo
## 355           Self Employed     50 2016-03-06   Santa Clara      San Mateo
## 453                 Teacher     10 2016-04-24       Ventura    Los Angeles
## 602                 Retired     10 2016-04-29   San Joaquin        Alameda
##     county
## 39    <NA>
## 255   <NA>
## 310   <NA>
## 355   <NA>
## 453   <NA>
## 602   <NA>
  • unused columns such as city, zip can be dropped

Merge candidates, votes to contributions

  • This is convenient since dataset is not very large, otherwise, it should be stored in several tables as SQL database.

  • add party of contributions

Data Overview

Dataframe contributions

  • contributions is the main dataframe for exploration
##   candidate                  employer        occupation amount       date
## 1   Clinton                        NA           Retired     50 2016-04-26
## 2   Clinton                        NA           Retired    200 2016-04-20
## 3   Clinton                        NA           Retired      5 2016-04-02
## 4   Sanders At&T Government Solutions Software Engineer     40 2016-03-04
## 5   Sanders  Vericor Enterprises Inc.        Pharmacist     35 2016-03-05
## 6   Sanders  Vericor Enterprises Inc.        Pharmacist    100 2016-03-06
##            county    party
## 1           Marin Democrat
## 2 San Luis Obispo Democrat
## 3  San Bernardino Democrat
## 4         Ventura Democrat
## 5     Los Angeles Democrat
## 6     Los Angeles Democrat
##   candidate           employer          occupation            amount        
##  Length:1125659     Length:1125659     Length:1125659     Min.   :-10500.0  
##  Class :character   Class :character   Class :character   1st Qu.:    15.0  
##  Mode  :character   Mode  :character   Mode  :character   Median :    27.0  
##                                                           Mean   :   121.8  
##                                                           3rd Qu.:    97.0  
##                                                           Max.   : 10800.0  
##       date               county             party          
##  Min.   :2013-11-05   Length:1125659     Length:1125659    
##  1st Qu.:2016-02-29   Class :character   Class :character  
##  Median :2016-05-03   Mode  :character   Mode  :character  
##  Mean   :2016-05-03                                        
##  3rd Qu.:2016-08-01                                        
##  Max.   :2016-10-19
##  candidate   employer occupation     amount       date     county      party 
##          0          0          0          0          0      14246          0
  • votes contains the background information and election results by county
##      county Clinton Trump family_income population
## 1   Alameda  300185 59384         90822    1510271
## 2    Alpine     318   211         71932       1175
## 3    Amador    3420  6069         68765      38091
## 4     Butte   31638 35354         56934     220000
## 5 Calaveras    6225 10126         67100      45578
## 6    Colusa    1832  2709         56472      21419

Univariate Plots Section

In this part, we will use the univariate plots to show basic statistics of the data.

Group and summarize dataframe

  • It’s important to group and summarize columns for detailed investigation.
  • It will be used both in univariate and bivariate plots.

Contribution count by date

  • numbers of contributions has two peaks at 16/04 and 16/10

Contribution amount distribution

  • most of the contribution amount is winthin $100, with some peaks at $250 and $500

Countribution count by county

  • group and summarize contribution by county, then merge with votes

  • counties with top counts of contribution are Los Angeles, San Diego, Alameda, San Francisco and Orange.

  • they are also among the counties with top 10 population

Contribution count by candidate

  • group and summarize contribution by candidate, then merge with candidates

  • The Democrat receives most counts of contributions in CA, more than four times of the total of others.

Contribution count by employer and occupation

  • There are large number of categories and NA entries for employer data.

numbers of unique employer and NA

## [1] 57976
## [1] 19951
  • But we can still see the patterns that most counts of contribution come from the unemployed, self-employed, retired, self-employed, unversity stuff or student, as well as software and Internet industry.

  • The situation of contribution count vs occupation is similar to employer, with the variety decreases nealy by a half.

unique numbers of occupation type and NA

## [1] 25385
## [1] 7239
  • The top occupations are the retired, unemployed, attorney, enginneer, professor, homemaker, self-employed, students, etc.

Univariate Analysis

What is the structure of your dataset?

  • The source data are within three related dataframes, the details can be found at the data wrangling section.
  • The contributions contains all contribution related information.
  • The votes table contains background information such as income, population by county. It also has the votes result of presendential election of CA by county.

What is/are the main feature(s) of interest in your dataset?

  1. What’s the rough distribution of contribution counts over employer, occupation, date?
  • We need the basics statistics to figure out the more import factors for following analysis.
  • For example, we identify there are interesting and maybe important patterns for amount histogram and amount vs date. They may be closely related with the trends of overall process of campaign.
  • Meanwhile, contribution vs occupation or employer suffers from low quality problem, so we will focus only the top ranking and more robust data.
  1. What’s the popular candidate, county?
  • We want to figure this part out as the core of campaign is about attracting financial resources and popularity by each candidate/party, in each county.
  • This may have relation with the final electionr results that we will study soon.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

  1. We need background of the residents, mainly their incomes and local population.
  2. We also want to find some relationship between the contribution and election result, so the votes are also necessary.

Did you create any new variables from existing variables in the dataset?

  • Yes.
  • Most of the data analysis is within table contributions and dataframes grouped and summarized based on it, for example contributions.by_candidate, contributions.by_couty, and contributions.by_date.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

  • There are some regular data wrangling problems in first section, which as been well solved and documented.
  • In the plotting part, we observe several patterns that worth further study.
  • The distribution of contribution amount and count vs county, count vs candidate is very unbalanced, namely the top 3 could make up 80% of total value.
  • For the contribution count vs time and count vs candidate, we know as the campaign moved on, most candidates withdrew and only two of them left. When did they start to draw attention both in contribution and popularity? We will need to investigate that.

Bivariate Plots Section

In this part, we further analyze the datasets, focusing on the following questions.

Questions and Plots

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

  • There’s a positive correlation between contribution preference and voting preference.
  • Namely, Democrat contributers tend to vote for Clinton, Republican supporters for Trump.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

  • The other parties or candidates have overall little influence in the campaign.
  • The Democats and Republicans almost start the campaign at the same time, yet Democats manage to rush all the way until near election.
  • Republican’s final candidate Trump didn’t get much contribution, on the contrary, he could the cause of the slow down of Republican’s campaign starting from April 2016.

What was the strongest relationship you found?

  • The positive correlation between average contribution preference and voting preference.

Multivariate Plots Section

In this section, we will focus on study of final candidates Hillary and Trump, or the Democrat and Republican.

Which candidates/party got the most of the campaign?

  • The plot confirms that the Deomocrat receives most contributions in CA.
  • The ratio between two final candidates Clinton and Trump is more than 10 times.

What is the overall process of the financial campaign?

  • The change of total contribution of different parties shows Democrat have a strong increasing campaign trend.
  • For the Republican, it slows down by a small amount since April 2016.

How much do supporters contribute to each party?

  • Below is the scatter plot the average contribution amount diveded by average family income of each county.
  • It shows that Democrat supporters tend to donate more money than the Republic Supporters.

What’s the contribution preference of people with different occupation?

  • The following is barplots of the top 20 occupations and their contributions to each party, by count and by amount seperately.
  • Most people have strong preference of the Democrat over the Republic, with the exception of those who are homemaker, retired, enginner, sales, ceo, and self-employed.
  • The Repulican wins competive portion financial contribution from those who are homemaker, retired, CEO, sales, and lawer.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

  1. For the financial campaign in CA, the Democrat receives more contributions than the Republican, either by candidates, by party, or by time.
  2. The Democrat supporters tend to donate more money than the Republic supporters.
  3. Most people have strong preference of the Democrat over the Republic, with the exception of those who are homemaker, retired, enginner, sales, ceo, and self-employed.
  4. Most people have preference of the Democrat over the Republic, while the Repulican wins competive portion financial contribution from those who are homemaker, retired, CEO, sales, and lawer.

Were there any interesting or surprising interactions between features?

  1. Republican supporters at CA has a very low and stable relative donation amount on family income, while for Democrat supporters, there is a large variance, which implies their income is higher than the average. It could be due to the large ratio of rich people in the Democat supporters.
  2. There are obvious patterns in the supporters by occupations. We can identify that the strong suppoters of the Democrat are people in public services and high incoming industry, such as professor, software enginner and artist.
  3. For the Republican, the strong supporters are the low/middle class such as homemaker, self employed, retired, sales, engineer. However, there are also a large amount of contributiuon from ceo, consultant and lawyer.

Final Plots and Summary

Plot One

  • contribution count vs candiate and party

Description One

  • The plot shows us which candidates receive the most contributions by count from of the campaign.
  • The Democrat receives most contributions in CA, the Republican comes at the second, other parties have little share.
  • The ratio of contribution amount received between the final candidates Clinton and Trump is about 10 times.

Plot Two

  • vote preference vs contribution preference by county

Description Two

  • The plot shows positive correlation between contribution preference and voting preference.
  • Namely, Democrat contributers tend to vote for Clinton, and Republican supporters for Trump.

Plot Three

  • contribution among top occupations of different party

Description Three

  • The plot shows us what’s the contribution preference of people with different occupation.
  • Most people have overwelming preference of the Democrat over the Republic, both by contribution count and total amount.
  • However, there are a few interesting exceptions that the Repulican wins competive portion financial contribution from those who are homemaker, retired, CEO, sales, and lawer. More data about the contributers is needed for further intepretation of this difference.

Reflection

Conclusion

  • In this project, we will explore the financial contributions to 2016 US Presidential candidates of California (CA).
  1. We start with collecting the data for contributions and addition data for background information of counties in CA.
  2. Then the data is audited, cleaned, and most importantly, joined and merged into dataframes for later exploration, including contributions, votes, and so on.
  3. We first investigate univariate cases to gain a deeper understanding of the data. After that we continue with bivariate and multivariate analysis, focusing on questions of the campaign process of each candidate/party, and its relationship with average family incoming and occupations, as well as its relationship with the final election votes of each county.

Challenges

  1. The main challenge is the data wrangling process, R used different methods and utilities to wrangle data. It could be inflexible and a little painful as my preious experience is mainly in Python.
  2. The challenges for EDA process are designing questions and strategy, and collecting source data. They will decide the goal, process and quality of the analysis to a great extent. For example, with the help the contributions, counties, and votes data, we are enable to explore the inner relationship between the voters’ background, their contributions and election results.

Future work

  1. The analysis could be extended to the scale of the national campaign instead of a single state, it will give us a complete view of the campaign for each candidate/party.
  2. As we the election has ended and votes data is available. We can also get the popularity suvey result across the campaign and election. With them we may be able to find stronger connections between campaign, popularity, and election result.
  3. Similarly, we can combine detailed background information of each county, such as eduction level, economics statistics, historical data on elections, etc.

Reference